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Abstract 

^ , We present an efficient addition circuit, borrowing techniques from 

Cn ' the classical carry-lookahead arithmetic circuit. Our quantum carry- 

lookahead (qcla) adder accepts two n-bit numbers and adds them in 
O(logn) depth using 0{n) ancillary qubits. We present both in-place and 
out-of-place versions, as well as versions that add modulo 2" and modulo 
2"-l. 



O 

f^ . Previously, the linear-depth ripple-carry addition circuit has been the 

method of choice. Our work reduces the cost of addition dramatically 
with only a slight increase in the number of required qubits. The QCLA 
adder can be used within current modular multiplication circuits to reduce 
substantially the run-time of Shor's algorithm. 



qh; 1 Introduction 

With the advent of Shor's algorithms for prime factorization and the discrete 
K^ , logarithm problem, it is necessary to design efficient quantum arithmetic cir- 

^ ' cuits. Previous quantum addition circuits include the quantum ripple-carry 

adder of Vedral, Barenco, and Ekert [7], which has recently been improved [1], 
and the transform adder [2] . Both of these approaches have depth linear in the 
number of input bits. We present a new adder whose depth is logarithmic in 
the number of input bits. The circuit size, and the number of ancillary qubits 
needed, are linear in the number of input bits. 

Our technique is derived from classical methods that perform in time loga- 
rithmic in the number of input bits. The classical carry-lookahead (cla) adder 
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[6, 5, 8] computes the carry bits in a tree-like structure, yielding a logarithmic- 
depth circuit. We can exploit this same structure to design a quantum CLA 
(qcla) circuit to add two n-bit numbers in O(logn) depth. The QCLA adder 
works in Z, and can be modified to add (mod 2") or (mod 2" — 1). 

The theory of carry- lookahead addition has been known for fifty years [6], 
and has appeared in circuit design textbooks [4, pp. 158-161], [3, pp. 84-91]. 
Why, then, is this paper necessary? What are the challenges of adapting the 
CLA technique to a quantum circuit? There are several constraints we have to 
consider: 



• 



• 



Reversibility: We are limited to operations which do not destroy informa- 
tion. 

Erasure: If we use scratch space, we must explicitly erase it. We will not 
be able to take advantage of quantum interference if our circuit leaves 
extra information in scratch registers. 

• Space-boundedness: We wish to minimize the use of ancillae. 

• Bounded fan-out: At a given time, we can only use a wire as an input to 
a single quantum gate. To use multiple copies of a bit, we must explicitly 
perform a fan-out operation, which increases the size and depth of the 
circuit, and may increase the necessary space. 

In Section 2.2, we discuss the classical theory of carry-lookahead addition; 
we then adapt this theory to the quantum setting in Sections 3 and 4. Next, in 
Section 5, we discuss various modified versions of the addition problem: how to 
add (mod 2") or (mod 2" — 1), how to compare, and how to take an incoming 
carry bit as input. The complexities of the various circuits are summarized in 
Table 1 on page 19. Finally, wc close with some thoughts on future work. 

2 Preliminaries 

We first describe our notation throughout this paper. We then discuss the 
classical carry-lookahead adder. 

2.1 Notation 

We write the binary expansion of a number r as r = r„_ir„_2 ■ • ■ ?'0j where tq 
is the low-order bit. 

We generally represent negative numbers using two 's- complement arithmetic, 
in which the bitwise complement r' is equal to — r— 1. In Section 5.5, we consider 
one's-complem,ent arithm,etic, in which r' ~ — r. Note that, in this latter scheme, 
the all-zeros bit string and all-ones bit string both represent zero, so wc have to 
be careful when designing reversible one's-complcmcnt arithmetic circuits. 

In our circuit diagrams, time runs from left to right. We use the standard 
notation for quantum circuit operations: ® for negation, and • for a control. 



In this paper, our circuits are composed of NOT gates (also called negations), 
controUed-NOT (controUed-NOT) gates, and TofFoli gates. A controUed-NOT gate 
has a single control qubit connected to a NOT gate on the target qubit. A Toffoli 
gate has two qubits controlling the application of a NOT gate to the target qubit. 
Hence, all of our circuits are classical reversible circuits. 

We will refer to the two inputs to our addition circuit as a and b. Our goal is 
to compute the sum s, either in place (on top of b) or out of place. We compute 
s by first finding c, the carry bits, such that s = a(Bb(Bc. (If one computes the 
sum using standard school-book addition, then c is the sequence of carries.) 

We let w{n) denote the number of ones in the binary expansion of n. We 
observe that 



n - 



We denote logj simply by log. 



2.2 Classical carry-lookahead addition 

In this section, we describe the classical carry-lookahead addition circuit and our 
motivation for using a carry-lookahead structure. The CLA adder [5] sums two 
n-bit numbers in O(logn) depth. In this arithmetic circuit, partial information 
about the incoming carry bits is exploited to avoid a linear-time ripple-carry 
computation. The carry bit string can be computed using a tree structure to 
greatly reduce the number of required operations. 

The key ingredient is the carry status on an interval, denoted C[i,j]. This 
status can take one of three values: k represents "kill," g represents "generate," 
and p represents "propagate." Wc begin with a discussion of the carry status 
C[i,i + 1]. 

Suppose we are adding a and b, and wc have computed the carry bit c^. The 
next carry bit q+i is the majority function MAJ(ai,6i,Ci). The base case for 
this process, cq, is assumed to be — see Section 5.2 for discussion of the more 
general problem where cq is an input bit. 

When a.i = hi, we can determine the carry bit c^+i without knowing q. 
Specifically, if a^ = 5^ = 0, then the outgoing carry bit Cj+i is automatically 
"killed" and we set q+i = 0; we say that C[i, i + 1] = k. Similarly, if ai = bi ~ 1, 
then a carry bit is "generated" and c^+i = 1 with carry status C[i,i + 1] = g. 

If at 7^ hi, then we cannot determine q+i without knowing q. In this case, 
the carry a is "propagated" and we set Ci+i = q with carry status C[i, i+1] = p. 
Figure 1 summarizes this computation of the carry status. 

Given C[i — 1, i] and C[i, i-|-l], we can compute the carry status C[i — 1, i + 1]. 
The calculation is shown in Figure 2; we use ® to denote the carry status 
operator. If C[i — 1,« -I- 1] = k, then either a carry is killed at position i, or it 
would be propagated at position i but has been killed at position z — 1. Either 
way, if C'[i — 1, i + 1] = k, we know that q+i = 0. Similarly, if C[i — 1, i + 1] = g, 
we know that q+i = 1. If C[i — 1, i + 1] = p, we conclude that Ci+i = Ci-i. 
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Figure 1: The carry status of Ui and bi 
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Figure 2: The carry status assignments C[i — l,i + 1] given the previous and 
current carry status values. 



The carry status operator (> 
for any k satisfying i < k < j, 



shown in Figure 2 allows us to merge intervals: 



C[t,j] = C[i,k]®C[k,j]. 

The choice of k does not affect the answer, since ® is associative. By successively 
doubling the sizes of intervals, we can use this approach to compute C[i, j] for 
any i,j in logarithmic depth. 

We now describe the computation of the carry bits in detail. Since C[i,j] 
can take three values, we must specify an encoding of C[i, j] in bits. We define 
p[i,j] to be 1 when C[i,j] = p, and we define g[i,j] to be 1 when C[i,j] = g. 
The relationship between C[i,j], p[i,j], and g[i,j] is depicted in Figure 3. Note 
that, in particular, we never have p[i,j] ~ gihj] ~ 1- 
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Figure 3: The carry status C[z, j] encoded in two bits p[i,j] and g[i,j]- 



For any i, j, p[i, j] is 1 if a carry propagates from bit position i to bit position 
J, and otherwise. Note that this occurs if and only if a^ ® &^ = 1 whenever 
i < £ < j. For any k between i and j, a carry bit is propagated from bit i to bit 
j if a carry bit is propagated from i to bit k, and then also propagated from bit 
k to bit j. Thus, the computation of the propagate bits, for any i < k < j, is 



p[ij] ^p[i,k] Ap[k,j]. 



(2) 



Next, we consider g[i,j]- This quantity is 1 when a carry is generated be- 
tween bit positions i and j. The computation of the generate bits, for i < k < j, 
is 

g[h3] ^glk,]]"^ {g[i,k]^p[k,j]) 

= 9[k,j]®{9[i,k]^p[k,j]). (3) 

That is, either a carry bit is generated between bits k and j, or a carry bit is 
generated between bits i and k and then propagated from bit k to bit j. The 
second expression follows from the observation that g[k^j] and p[k,j] cannot 
both be equal to 1. 

For all j > 0, p[0,j] is 0, and g[0,j] is the carry bit Cj. By successively 
doubling the sizes of the intervals under consideration, we can compute all carry 
bits in logarithmic depth. 

3 Reversible computation of carry status 

We are now ready to build a quantum adder using the CLA technique. We first 
explain how we can compute the carry status reversibly. 

The circuit of this section has two input arrays, each of length n: Pq, initial- 
ized to Po[i] = p[i, « + 1], and G, initialized to G[i] = g[i — l,i]. Note that the 
array Pq is 0-based, but the array G is 1-based. We also use n — w{n) — [lognj 
ancillary bits, initialized to zero. 

At the end of the computation, we want G[i] = g[0, i] — Ci. We also need to 
erase our scratch work: we must ensure that, when we're done, Po [i] ~ p[i, i + 1] 
and the ancillary bits are reset to zero. 

We will have roughly [lognj rounds each of four different types: 

1. P-rounds: Compute p[i,j] values into the ancillary space. 

2. G-rounds: Set G[j] = (?[i, j]; for each j, we choose a particular i value. ^ 

3. G-rounds: Set G[j] — Cj. 

4. P^^-rounds: Erase the work done in the P-rounds. 

We first describe the sequence of gates, and then we compute the circuit depth. 

In P-round t, we compute all p- values of the form p[i,j] where i = 2^m and 
j = i + 2*. We refer to these values as Pt[m\, for 1 < to < [n/2*J. We store 
these [n/2*J — 1 values in our ancillary space. By (1), the total space needed 
for all of the P-rounds is n — w{n) — [lognj bits. We do not need to compute 
values of the form p[0, 2*], since no carry is generated at 0; in particular, when 
t = [lognj, no computation is done. 

We compute p[i,j] using (2), with k = 2*m -t- 2*"-^. Note that both p[i,k] 
and p[k,j] were computed in P-round (t — 1), so we can write p[i,j] to the 

^We have i = {j — 1) A j, where — denotes subtraction in Z, and A denotes bitwise AND. 



appropriate location using one TofFoli gate. The total number of gates is thus 
n — w{n) — [lognj. 

In G-round t, we compute all g-values of the form 5[j,j] where i = 2*to 
and j = i + 2*. We store this value in the location G[j]. We use (3), with 
k = 2*TO + 2*~^. Since g[k,j] is already in location G[j] after G-round (t — 1), we 
can do this computation with a single Toffoli gate, combining g[i, k] (computed 
in G-round (t — 1)) and p[k, j] (computed in P-round {t — 1)). The total number 
of gates is n — w{n). 

In G-round t, we compute all g- values of the form .g[0, j] with j = 2'm + 2*~^. 
We begin with the maximum t for which some j exists, t = [log ^J = 1 + 
[log ^J , and work our way down to i = 1. We use (3) with k = 2*to. Since 
g[k,j] is already in location G[j], we again need just one ToffoH gate: we require 
p[k,j] (computed in P-round {t — 1)) and ,g[0, fc] (computed in G-round [t -I- 1) 
or earlier). The total number of gates is n — [lognj — 1. 

Finally, in the P^^-rounds, we simply repeat the same Toffolis as in the 
P-rounds, in reverse order. 

In summary, we must perform the following steps: 

1. P-rounds. For f = 1 to [lognJ — 1: for 1 < ra < [n/2*J: 

Pt[m] 0= Pf_i[2?7i]Pt_i[2?7i + 1]. 

2. G-rounds. For i = 1 to [lognJ: for < m < [n/2*J: 

G[2*m -t- 2*] e= G[2*m + 2*-^]Pt_i[2m + 1]. 

3. G-rounds. For t = [log ^J down to 1: for 1 < m < [(n - 2*-i)/2*J : 

G[2*TO + 2*-i] ©= G[2*m]Pt_i[2m]. 

4. P^^-rounds. For t ~ [lognJ — 1 down to 1: for 1 < m < [n/2*J: 

Pt[m] e= Pt_i[2m]Pt_i[2m + 1]. 

The circuit consists of 

4n-3w(n) -3[lognJ - 1 (4) 

Toffoli gates. 

It would seem that the circuit described above would require roughly 4 log n 
time-slices. However, we can overlap some of the computation. 

We start with P-round 1, which uses the arrays Pq and Pi. Then, P-round 
2 uses the arrays Pi and P2. Note that G-round 1 uses the arrays G and Pq; 
hence, we can run G-round 1 in the same time-slice as P-round 2. In general, 
we can run P-round (t + I) and G-round t in parallel. 

Similarly, once we have run G-round t, we are done using Pt-i- While we 
run G-round [t — 1), we can run P^^-round t, which uses Pt-i to erase Pf. We 



run C-round 1 in parallel with P-round 2; we then need one additional time-slice 
to run P-round 1. 

So, the circuit has a depth of 

[lognj + [log-J +3. (5) 

For ri < 3, expression (5) overcounts the depth, since there are no P-rounds. 

4 The complete quantum addition circuit 

We are now ready to describe our quantum carry- lookahead addition circuit. 
We first discuss the out-of-place version in Section 4.1, and then the in-place 
version in Section 4.2. 

The out-of-place version produces n+ I bits of output, and uses n — w{n) — 
[log n\ ancillae. The depth is 2 log n + 0{l) and the size is 8n — 0(log n) gates. 

The in-place version produces 1 bit of output, and uses 2n — 'w{n)— [lognj —1 
ancillae. The depth is 41og?i + 0(1) and the size is IQn — O(logn) gates. 

Table 1 on page 19 summarizes the complexities of these two adders, as well 
as the variants discussed in Section 5. 

4.1 Addition out of place 

We would like to add two n-hit numbers, a and 6, stored in arrays A and B. We 
need n + 1 bits for the output, denoted by Z, and n — w(n) — [lognJ ancillary 
bits, denoted by X. We assume that Z and X are initialized to zero. In the 
end, we want Z to contain the quantity s = a + b. 

The key relation is that the sum s is equal to a(Bb(B c, where c is the carry 
string. Hence, the key step in our algorithm is to compute c, using the technique 
of the previous section. We compute the carry string ci through c„ into the bits 
Z[l] through Z[n]. 

The out-of-place QCLA adder proceeds as follows: 

1. For < i < n, Z[i + 1] ©= A[i]S[i]. This sets Zi+i = g[i, i + 1]. 

2. For 1 < i < n, B\i] ®= A[i]. This sets B\i] == p[i, i + 1] for i > 0, which is 
what we need to run our addition circuit. 

3. Run the circuit of Section 3, using X as ancillary space. Upon completion, 
Z[i\ ~ Ci for i> I. 

4. For <i < n, Z[i] 0= B[i]. Now, for i > 0, Z[i] = a^ 6^ q = Si. For 
I = 0, we have Z[i] = hi. 

5. Set Z[0] ®=A[0]. For 1 < i < ji, B\i] ®== A\i]. This fixes Z[0], and resets 
B to its initial value. 
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Figure 4: Out-of-place QCLA adder for 10 bits. P-rounds and P ^-rounds are 
shown in blue. G-rounds are red, and C-rounds are green. 



Aside from Step 3, each step occurs in a single time slice. So, by (5), the 
overall depth of the circuit is 



71 i 

[lognj + [log -J +7, 



3- 

where three of the time-slices contain controUcd-NOTs and the rest contain Tof- 
folis. For n < 3; the depth is slightly lower. 
By (4). the circuit contains 

5n — 3w{n) — 3 [lognj — 1 

Toffoli gates and 3n — 1 controUcd-NOTs. 

The circuit for n = 10 is depicted in Figure 4. 

4.2 Addition in place 

For the in-placc circuit, we begin the same way as above: we compute the carry 
string c into n — 1 ancillary bits (plus one output bit for the high bit). The total 
ancillary space required is 2n — w{n) — [lognj — 1. We then write the low n bits 
of the sum on top of b. The key new step is the erasure of the low n—1 bits of 
the carry string c. 

Recall from Section 2.1 that we are using two's-complement arithmetic: 

r' + r = -l (mod 2"). 

So, writing s = a + b, 

a + s' = a-a~b-l = b' (mod 2"). 

Let d be the carry string generated by a and s' . We have 

a®s'®d = b' 
a®{a®b®c)® (-1) © d = 6 ® (-1) 



So the carry string d, generated by adding a and s', is simply c. After we 
compute s, we can complement it, and then run the circuit of Section 3 in 
reverse to erase c. 

The in-place QCLA adder proceeds as follows. We denote the n — 1 ancillae 
which store the carry string as Z[l\, . . . , Z[n — 1], and the remaining ancillae as 
X. The output bit is labeled Z[n]. 

1. For < i < n, Z[i + 1] ©= A[i]S[i]. This sets Z[i + 1] = g[i, i + 1]. 

2. For < i < n, B\i] ©== A\i]. This sets B\i] = p[i,i + 1] for i > 0. Also, 
B[0] = so. 

3. Run the circuit of Section 3, using X as ancillary space. Upon completion, 

Z\i] ^ c, for i > 1. 
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Figure 5: In-place QCLA adder for 10 bits. P-rounds and P ^-rounds arc shown 
in blue. G-rounds are red, and C-rounds are green. 
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4. For 1 < i < n, B[i\ ®^ Z[i\. Now B[i] = Si. 

5. For < i < n — 1, negate B[i]. Now B contains s'. 

6. For 1 < i < n - 1, B[i] ©= A\i]. 

7. Run the circuit of Section 3 in reverse.^ Upon completion, Z[i + 1] = ais[ 
for < i < n — 1, and B[i] ~ ai (B s'^ for 1 < i < n. 

8. For 1 < i < n - 1, B[i] 0= A\i]. 

9. For < i < n - 1, Z[i + 1] ®= A\i]B\i]. 
10. For < i < n - 1, negate B[i]. 

Each step other than 3 and 7 has depth 1. By (5), the overaU depth is 

n-1 



[fognJ + [log(n-l)J + 



log 3^ 



+ 



log- 



14, 



where two of the time-slices contain negations, four contain controUed-NOTs, 
and the rest contain Toffolis. For some values of n < 6, the depth is slightly 
lower. 

By (4). the circuit contains 

lOn - 3w{n) - 3w{n - 1) - 3 [lognj - 3 [log(n - 1)J - 7 

Toffoli gates, 4n — 5 controUed-NOTs, and 2n — 2 negations. 

In Figure 5, we show a sample in-place QCLA adder for the case n = 10. 

5 Extensions 

We now discuss various modified versions of the circuit. The simplest is one 
which adds (mod 2"); we simply skip the computation of the high bit. With 
slightly more work, we can add (mod 2" — 1). 

There are other constructions which use the log-depth adder as a subroutine. 
For these, it is useful to allow the adder to take one additional bit, an incoming 
carry. In this case, we wish to compute a + b + y, where y is either or 1. 

Another useful subroutine in an addition (or modular addition) circuit is 
comparison: Is a > 6? Equivalently, is the high bit oi a — b zero? We discuss 
how one can use the log-depth adder to subtract, and we show that a comparator 
is of comparable complexity to an out-of-place adder. 

^In Step 7, we actually reverse the {n — l)-bit adder, since we should not erase the high 
carry bit. See Section 5.1 for more discussion. 



11 



5.1 Addition (mod 2^) 

It is straightforward to add (mod 2"); we simply do not compute the high bit of 
the sum. The only question is: what are the exact savings, in depth and circuit 
size? 

Since we do not need to compute c„, we can simply run the circuit of Sec- 
tion 3 on the low-order n — 1 bits of a and h. For the out-of-place adder, this 
circuit leaves c„_i in Z[n~l], so we also need to apply the gates Z[n— l]©=:a„_i 
and Z[n — 1] ©= 6n-i; we therefore add two additional controUed-NOTs. For 
n > 1, this does not increase the depth. 

So, the out-of-place (mod 2") adder produces n output bits, and uses {n — 
1) - w{n - 1) - [log(n - 1)J anciUac. The depth is [log(n - 1)J + [log ^J + 7 
when n > 4, and the circuit consists of 5n — 3w(n — 1) — 3 [log(n — 1)J — 6 
Toffolis and 3n - 2 controUed-NOTs. 

For the in-place adder, we follow the steps in Section 4.2. However, in Step 1, 
our loop now stops at i = n — 2, and, in Step 3, we run the (n — l)-bit adder. 

Thus, the in-place (mod 2") adder uses 2n — 2 — w{n — 1) — [log(?i — 1)J 
ancillae. The depth is 2 [log(n - 1)J + 2 [log ^^J -I- 14 when n > 5, and the 
circuit size is lOn — 6w{n — 1) — 6 [log(n — 1)J — 12 Toffolis, 4n — 5 controUed- 
NOTs, and 2n — 2 negations. 

5.2 Addition with incoming carry 

Suppose we want our adder to take 2n + 1 bits of input: a, b, and a single 
bit y, representing an incoming carry. This is useful in various hybrid addition 
circuits, where we break the problem up into smaller pieces. 

We can accomplish this by adding the {n+ l)-bit numbers 2a + y and 2b + y, 
whose sum is 2{a + b + y). So, the cost is roughly the same as that of an (n-|-l)- 
bit add. However, we use fewer operations on the low-order bit — we simply 
start with ci — y. The additional input bit y replaces one output bit for the 
out-of-place adder, and one ancillary bit for the in-place adder. 

For the out-of-place adder, we save one Toffoli and two controUcd-NOTs over 
the usual (n -I- l)-bit adder. For the in-place adder, we save two Toffolis, one 
controUcd-NOT, and two negations. 

The same analysis applies to the (mod 2") adder of Section 5.1. 

5.3 Subtraction 

It is straightforward to use our circuit to compute a — b. First, complement all 
bits of a. Then, add as usual; wc compute a' + b. At the end, complement all 
bits of a and all output bits. The result, assuming two's-complcmcnt arithmetic, 
is then 

(a' + 5)' = {-a - 1 + b)' ^ a - b. 

A similar argument holds for one's-complement arithmetic. 

Hence, the cost of subtraction is essentially the same as the cost of addition. 
We add two time-slices, both consisting only of negations. 
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5.4 Comparison 

Suppose we wish to compare two numbers a and b. We compute the high bit 
of a — 6. As in the subtractor, we first complement a. We then run the QCLA 
adder forward until we have found the high bit of a' + b, and then we reverse 
the preceding computation. This gives us a QCLA comparator. 

If n = 2*^ for some k, then the above idea works well; we find the high bit 
at the end of the G-rounds, halfway through the out-of-place add. However, 
for n = 2*^' — 1, wc do not compute the high bit until after we're done with the 
C-rounds. If we just use this simple approach, the depth of our circuit turns 
out to be 2 [lognj + 2w{n) + 5. Wc would prefer to design a comparator which 
has depth 21ogri + 0(1). 

So, we have to be more careful. Let k — [logn]. If we just do and undo the 
P-rounds and G-rounds, we can compare two 2'"'-bit numbers in depth roughly 
2k. So, we can pad our n-bit numbers by adding zeros to the front, and then 
use the compare circuit for 2'^-bit numbers. 

After we complement a, we will have /'[i, j] = 1 for j > i > n and .g[«,j] = 
for j > i > n. We do not explicitly compute these values in our circuit; 
effectively, we compile the values into the circuit. 

Overall, the comparator uses 2n — [log(ri — 1)J — 3 ancillae. For the explicit 
discussion, we suppose our input is stored in the n-long bit arrays A and B. Wc 
have n — 1 ancillary bits denoted Z[l], . . . , Z[n — 1], and n — [log(n — 1)J — 2 
additional ancillae denoted by X. The output bit is denoted Z[n]. We proceed 
as follows: 

1. For < i < n, negate A[i]. 

2. For < i < n, Z[i + 1] ©= A[i]S[i]. 

3. For 1 < i < n, B[i\ ©= A[i]. 

4. Do the P- rounds for the 2'"'-bit adder using space X; write only the values 
we cannot deduce at compilc-time. 

5. Do the G-rounds for the 2'^-bit adder; apply only those gates which affect 

Z[n]. 

6. Undo the G- round gates which did not write to Z[n]. 

7. Do the P~^-rounds for the 2'^-bit adder, erasing X. 

8. For 1 < i < n, B[i] ©= A[i]. 

9. For < i < n - 1, Z[i + 1] ©= A\i]B\i]. 

10. For < i < n, negate A[i]. Also negate Z[n]. 

Step 5 contains n — 1 TofFoli gates, in depth [log(n — 1)J + 1. Step 6 is 
equivalent to inverting the G-rounds for an (n — l)-bit adder, and contains 
n — w{n — 1) — 1 Toffoli gates in depth [log(n — 1)J . 
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Steps 4 and 7 each consist of n — [log(n — 1)J — 2 gates. In each case, the 
depth would be [log(n — 1)J, but each P-round after the first (and each P~^- 
round before the last) can be done in parallel with a G-round. 

The total depth for the comparator is 

2 Llog(n - 1)J + 9, 

where two of the tinic-sliccs contain negations, two contain controUcd-NOTs, 
and the rest arc Toffolis. The overall circuit size is 

6n - 2 Llog(7i - 1)J - w{n - 1) - 7 

Toffoli gates, 2n — 2 controUcd-NOT gates, and 2n+ 1 negations. When n < 4, 
we have slightly overcounted the depth and size. A sample comparison circuit 
for n ~ 7 appears in Figure 6. 

If wc wish to allow an incoming carry, we use the same technique as in 
Section 5.2. We use an (n + l)-bit comparator, except that the carry input 
replaces one of the ancillae, and we save two negations and two Toffolis. 

It may seem strange that an n-bit compare would require more gates than 
an n-bit out-of-place add. After all, we're solving a simpler problem; we want 
one bit of the (n + l)-bit answer. 

One way to look at this phenomenon is that, when wc compute the high bit 
of the sum, we are effectively using other output bits as ancillary space. If we're 
"only" doing a compare, we need extra gates to erase this space. One explicit 
example of this is Step 9 of the compare, where we erase the generate array. For 
the out-of-place add, the generate array has turned into our answer, and need 
not be erased. 

5.5 Addition (mod 2" - 1) 

Recall from Section 2.1 that we have been working in two's-complement arith- 
metic, where r' + r = —1. With a slight increase in depth, we can modify our 
circuit to work in one's-complement arithmetic, where r' + r = 0. Equivalently, 
we can view one's-complement addition as addition (mod 2" — 1). This may 
prove useful for some applications, particularly when 2" — 1 is prime. 

Note that, in one's-complement arithmetic, can be represented either by 
the all-zeros bit string or the all-ones bit string. For in-place reversible compu- 
tation, we cannot have a + = a and a + 1 = a. We will first describe our adder 
in general terms, and then discuss how we can handle this zero problem. 

First, consider the computation of cq. In the one's-complement setting, we 
can no longer assume Cq to be 0; the low bit of the sum is affected by whether or 
not we have an overflow. We have an overflow if and only if a -I- 6 > 2"; hence, 
we get Co = g[Q,n]. 

If n = 2*^ for some k, then we have computed cq at the end of the G- 
rounds. How do we compute the other carry bits? One approach follows from 
the cyclic invariance of addition (mod 2" — 1): We note that multiplication by 
2^ corresponds to a cyclic shift by j. So, if we could simultaneously add at all 
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Figure 6: QCLA comparator for 7 bits. P-rounds and P ^-rounds arc shown in 
blue; G-rounds and G~^-rounds are red. 



15 



possible cyclic shifts, we would compute all of the carry bits. This approach 
would have logarithmic depth, but would require 0(nlogn) ancillary space. 

A second idea is to view cq as an incoming carry (7[— oo, 0]. Our carry string 
is then given by Ci = g[—oo, i\. Another way to look at this identity is that we 
are wrapping around: to compute c^, we start at the zero position, work up to 
n, and then wrap back around and keep going up to i. This is the same as the 
cyclic shift, except that we are counting one region twice; it is not hard to see 
that this cannot affect our overall answer. 

To do this wrap-around, we will need propagate bits of the form p[0, 2*]. Af- 
ter we complete the P-rounds and G-rounds, we have computed cq = g[— oo, 0]. 
We now add a new round: using cq and p[0, 2*^"-^], we use one Toffoli gate to 
compute C2fc-i = g[—oo, 2*^^^]. From here on, we do our usual C-rounds, except 
that each contains one extra gate computing 02* . Upon completion, we have 
successfully computed the carry string. 

If n is not a power of 2, we need to do a bit more work to make sure we 
compute Co at the end of the G-rounds. Wc use the same technique as in 
Section 5.4. 

5.5.1 Out-of-place addition (mod 2" — 1) 

The above description, combined with the general approach in Section 4.1, yields 
an out-of-placc one's-complcmcnt adder. We produce n bits of output, and use 
n — 2 ancillae. 

The overall depth is 

2 Llog(n - 1)J + 8, 

where three of the time-slices contain controUed-NOTs and the rest are Toffolis. 
For n < 2, the depth is slightly lower. 

The circuit contains 5n — 6 Toffoli gates and 3n controlled-NOT gates. 

Suppose we use the above circuit to add two numbers a and b which sum 
to 2" — 1. Since a © 6 = 1, the generate array will be initialized to and the 
propagate array to 1. Hence the carry string will be 0, and the sum will be 
output as 1. Hence, we say that this circuit uses the 1 representation of zero. 
One can check that, if one of the inputs is 1, the circuit also adds correctly."^ 

It might seem more natural to represent zero as 0. We can modify the out- 
of-place circuit as follows: at the end of the P-rounds, we XOR p[0,n] into cq 
(this requires one additional Toffoli gate). So, after the G-rounds, we will have 
Co = ff[0, n]©p[0, n]. We then compute the G-rounds as before. Now, if a®6 = 1, 
the circuit will output as the sum; we have thus given a circuit which uses the 
representation of zero. Again, we can check that, if one or both inputs are 0, 
the circuit performs correctly. 

The out-of-place one's-complement adder using the representation requires 
n — 2 ancillae. The circuit contains 5n ~ 5 Toffoli gates and 3n controlled- 
NOT gates. For some n, the depth goes up by one; it depends on whether 

^In fact, this circuit is also correct when exactly one of the inputs is 0. But, if both inputs 
are 0, we output rather than 1. 
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the computation of p[0,n] can be done simultaneously with the penultimate 
G-round. For n > 4, the depth can be written as 



Llog(n - 1)J 



n — 1 
log 



10, 



3 

where three of the time-slices contain controUcd-NOTs and the rest are TofFolis. 

5.5.2 In-place addition (mod 2" — 1) 

Following Section 4.2, we next construct an in-place one's-complement adder. 
We require 2ti — 2 ancillary bits: n — 2 for computing the propagate bits, and 
n for the carry string. We first compute the carry string into our ancillae, and 
then write the sum on top of b. Next, to erase the carry string, we negate 6, 
undo the addition computation, and fix b at the end. 

As in the out-of-place version, we need to be careful about the representation 
of zero. The complementation of b introduces a slight wrinkle: if we do our first 
addition using to represent zero, then we need to undo a 1-based addition. If 
we do our first addition using the 1 circuit, we need to undo the circuit. 

Hence, regardless of whether we represent zero by or 1, the cost of the 
circuit is the same: we require 2n negations, 4n controUed-NOTs, and lOn— 11 
Toffolis. For n > 4, the depth is: 



3 Llog(n - 1)J + 



1 "-1 



18, 



where two of the time-slices contain negations, four contain controUed-NOTs, 
and the rest contain Toffolis. 

Figure 7 depicts a sample in-place onc's-complcment QCLA adder for the case 
n = 7. 

6 Conclusions and future work 

In conclusion, we have developed an efficient addition circuit using classical 
carry-lookahead techniques. Our QCLA adder sums two 7i-bit numbers in-place 
using 2n — ■w{n) — [lognj — 1 ancillary qubits in depth 41ogn + 0(1). This 
improves upon the previous best known addition circuits, which require linear 
depth. Our work dramatically improves the run-time of the arithmetic circuits 
required in Shor's algorithm. 

The complexities of the various circuits in this paper arc summarized in 
Tables 1 and 2. In Table 1, we assume n = 2*^; in Table 2. we give the general 
formulas. For simplicity, we count only Toffoli gates and Toffoli time-slices. 
Since some of the formulas are incorrect for small n, we assume n > 7. We also 
include two different ripple-carry adders [7, 1]. 

It would be interesting to apply a similar tree-like approach to other arith- 
metic problems, such as modular addition and multiplication. It would also 
be interesting to build a logarithmic-depth addition circuit using only O(logn) 
ancillae. or to prove that no such classical reversible circuit exists. 
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Figure 7: In-place QCLA adder (mod 2" — 1) for n = 7. P-rounds and P~^-rounds are shown in blue. G-rounds are red, and 
C-rounds are green. This adder uses the representation of zero. Note that the extra gate writing p[0, n] to Cq is present only 
during the first half of the computation. 
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Table 1: Circuit summary, for 71 = 2'^, where fc > 3. The first column gives the 
function being computed. The second lists whether the computation is done in 
place, and the third lists whether we take an incoming carry bit as input. We 
then list the number of input, output, and ancillae bits, and the circuit size and 
depth. For the purposes of this table, we only count Toffoli gates. 
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Table 2: Circuit summary, for n > 7. The first column gives the function being computed. The second lists whether the 
computation is done in place, and the third lists whether we take an incoming carry bit as input. We then list the number of 
input, output, and ancillae bits, and the circuit size and depth. For the purposes of this table, we only count Toffoli gates. 
Recall that w{n) is the number of ones in the binary expansion of n. 
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